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ABSTRACT 

Over the last two decades, scientific workflow management 
systems (SWfMS) have emerged as a means to facilitate 
the design, execution, and monitoring of reusable scientific 
data processing pipelines. At the same time, the amounts of 
data generated in various areas of science outpaced enhance- 
ments in computational power and storage capabilities. This 
is especially true for the life sciences, where new technolo- 
gies increased the sequencing throughput from kilobytes to 
terabytes per day. This trend requires current SWfMS to 
adapt: Native support for parallel workflow execution must 
be provided to increase performance; dynamically scalable 
"pay-per-use" compute infrastructures have to be integrated 
to diminish hardware costs; adaptive scheduling of workflows 
in distributed compute environments is required to optimize 
resource utilization. In this survey we give an overview of 
parallelization techniques for SWfMS, both in theory and in 
their realization in concrete systems. We find that current 
systems leave considerable room for improvement and we 
propose key advancements to the landscape of SWfMS. 
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1. INTRODUCTION 

Over the last two decades, the scientific community wit- 
nessed the establishment of computation as an integral part 
of research beside the traditional paradigms of theory and 
experiment [T]. Today's scientific experiments typically in- 
volve running and refining a series of intertwined computa- 
tional analysis and visualization tasks on large amounts of 
data. The complexity of these so-called analysis pipelines 
resulting in high costs for development and maintenance, 
the need for sharing knowledge encoded in these pipelines 
as well as hardware to execute them, and the need for re- 
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peat ability and rigorous tracing of pipeline runs, eventuated 
in the emergence of e-Science [2] and scientific workflows [3]. 

Scientific workflows are compositions of sequential and con- 
current data processing tasks, whose order is determined by 
data interdependencies [4] . A task is the basic data process- 
ing component of a scientific workflow, consuming data from 
input files or previous tasks and producing data for follow- 
up tasks or output files (see Figure [T]). A scientific workflow 
is usually specified in the form of a directed, acyclic graph 
(DAG), in which individual tasks are represented as nodes. 
Scientific workflows exist at different levels of abstraction: 
abstract, concrete, and physical. An abstract workflow mod- 
els data flow as a concatenation of conceptual processing 
steps. Assigning actual methods to abstract tasks results 
in a concrete workflow. If this mapping is performed auto- 
matically, it is called workflow planning [5 . To execute a 
concrete workflow, input data and processing tasks have to 
be assigned to physical compute resources. In the context of 
scientific workflows, this assignment is called scheduling and 
results in a physical and executable workflow 6 . Low- level 
batch scripts are a typical example of physical workflows. 
See Figures [2] and [3] for two concrete bioinformatics work- 
flows. 




Figure 1: The structure of a single workflow task. A 
task receives input data at its input ports. This data 
is processed according to a certain algorithm (ser- 
vice invocation, program call, etc.) and parameters. 
Generated output data is passed from the output 
ports to follow-up tasks for further processing. 



Deelman et al. [1 constitute four phases of the workflow 
lifecycle: (1) the design and composition of concrete work- 
flows; (2) the mapping of concrete workflows to the under- 
lying physical resources (scheduling); (3) the execution of 
physical workflows; (4) the recording of metadata and prove- 



Table 1: Most popular publicly available SWfMS 



SWfMS languages URL 
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Taverna Scufl, T2flow taverna.org.uk 
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Swift 
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knime.org 
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nance information at all stages of the workflow lifecycle. 
Scientific workflow management systems (SWfMS) address 
this lifecycle or subsets thereof by providing capabilities for 
modeling, executing, monitoring and storing scientific work- 
flows. Most SWfMS operate on concrete workflows by re- 
questing users to specify compositions of concrete processing 
tasks. 

The last years have seen considerable progress in the de- 
velopment and deployment of SWfMS. Examples for con- 
crete systems include Taverna [7 , Kepler 8 , Pegasus [9], 
and KNIME 10 (see Table [TJ. A comprehensive list of 
universal and bioinformatics SWfMS, both commercial and 
from the public domain, has been assembled by Tiwari and 
Sekhar [11] , To narrow the gap between domain scientists 
and application developers, the majority of these SWfMS 
involve a graphical user interface with drag and drop func- 
tionality to facilitate the composition of tasks into work- 
flows fT2]. 

While most SWfMS provide a general-purpose framework 
for workflow enactment, others exhibit a certain affinity for 
a confined field of science and inherently provide domain- 
specific components. For instance, Taverna and KNIME 
provide built-in libraries for the fields of computational bi- 
ology and chemistry, respectively. Further, some systems 
have been designed to cater exclusively to a specific user 
group or scientific domain. Frameworks like Galaxy [13] , 
Mobyle [14] or Conveyor [15] have been developed solely for 
applications in the life sciences. Clearly, there are trade-offs 
to consider between domain-specific and general-purpose ap- 
proaches. While confinement to a particular field facilitates 
use for domain scientists and may help to promote design 
standards, it requires SWfMS providers to keep built-in com- 
ponents up to date and limits versatility and interoperability 
of workflow design. 

In this work, we distinguish three major classes of SWfMS: 



• Textual workflow languages: This category consists of 
low-level textual languages catered to computer-savvy 
users adept at using batch scripts and programming 
languages. Workflows are specified in the form of often 
complex configuration files, which are interpreted and 
executed by the SWfMS. The Pegasus workflow man- 
agement system [9] and Swift scripting language [16] 
are typical examples for this group of systems. 

• Graphical workflow systems: SWfMS belonging to this 
group put a strong emphasis on ease of use. They 
provide a graphical user interface for workflow design 



and execution monitoring as well as a range of gen- 
eral purpose and often domain-specific task libraries. 
While some of these systems utilize a textual workflow 
language for internal representation (see Table [T]), this 
language is not intended and designed to be accessed 
by the user. Examples include the Taverna [7] and 
Kepler 8 workflow systems. 

• Domain- specific web portals: This category comprises 
online portals, where scientists can design, execute, 
and share workflows within a certain domain. In most 
cases no software has to be installed locally. Workflows 
are composed exclusively from built-in components in 
a web browser and are executed on a public (or pri- 
vate) server. For instance, Galaxy [13] and Mobyle [14] 
can be considered as an enactment portal for research 
in the life sciences. 



This categorization was inspired by the work of Romano 
[18] . who proposed to subdivide SWfMS in the life sciences 
into software libraries, standalone systems, client/server sys- 
tems, and enactment portals. We utilize a slightly differ- 
ent categorization, as we found most standalone SWfMS to 
provide a client/server-based implementation as well and 
hence found it difficult to differentiate between the two. 
Also, we don't consider mere software libraries to qualify 
as SWfMS. 

The wide range of SWfMS outlined above provides solutions 
for analysis pipelines of researchers from various domains. 
However, due to ever-increasing amounts of data across all 
fields of science, the computational effort required to exe- 
cute a given scientific workflow is becoming more and more 
critical. In fact, the magnitude of data produced in many 
scientific domains has risen at exponential rates and often 
outpaced advances in storage capacity, network bandwidth, 
and processing power. In bioinformatics, for instance, recent 
years brought a new generation of devices, which produce 
genomic data at unprecedented scale [19]. The generation 
of genomic data was found to double every nine months - 
at a pace much faster than computing power and storage 
capacity (see Figure [4]). 

Besides algorithmic advances, the canonical way to deal with 
increasing data volumes is parallelization. This is true for 
all areas of computer science and is reflected in the devel- 
opment of parallel execution of threads on single chips as 
well as on infrastructures which combine multiple machines 
to clusters, grids, and clouds. Figure [5] displays the devel- 
opment of number of cores per CPU, as observed over the 
last decades. It showcases the trend towards multicore ar- 
chitectures in chip design. Figure [5] illustrates the number of 
objects stored in Amazon's Simple Storage Service (Amazon 
S3). The exponential growth of the largest cloud provider's 
data storage solution mirrors the trend of parallel computa- 
tion environments such as compute clouds increasing in size 
and popularity. 

While several SWfMS like DAGMan and Pegasus have been 
designed with parallel computation on shared resources in 
mind, they rarely provide multicore support and are gener- 
ally difficult to set up and utilize by domain scientists. As a 
result, there has been little uptake in the scientific commu- 
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Figure 2: A Taverna [7J bioinformatics workflow. Lists of proteins are retrieved and compared for pathogenic 
(disease) and non-pathogenic (healthy) genomes. Proteins unique to the pathogenic genome are located 
in KEGG pathways and investigated for the use as potential drug targets. The workflow was published 
in the myExperiment [IT] workflow repository under the id 1172. Nodes colored in light pink constitute 
subworkflows which have been collapsed for easier readability. 




Figure 3: A Galaxy [13] workflow for performing a metagenomic analysis on next-generation sequencing data. 
A metagenomic analysis compares the genomes of species in one environment to the genomes of species in 
another environment to find environment-specific genes. The workflow was published on the public Galaxy 
server under the name "metagenomic analysis". 



Sequencing Progress vs Compute and Storage 

Moore's and Kryder's Laws fall far behind 
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nity and installations of these systems on actual data cen- 
ters are few and far between [12]. More recent systems like 
Taverna put a stronger emphasis on usability, yet provide 
only limited means towards parallelization and utilization 
of distributed compute resources. This survey explores the 
gap between inherent parallelization and ease of use in cur- 
rent SWfMS. 

To this end, we present an overview of parallelization tech- 
niques for SWfMS. Workflow systems differ strongly in their 
support for parallelism and we believe that there is an ur- 
gent need for a comparative survey focusing on this aspect. 
We hope that this will serve as an entry point for both do- 
main scientists with data-intensive workflows at hand and 
SWfMS researchers with an interest in parallel computing. 
The three main contributions of this survey are: 
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Figure 4: Development of processor speed, HDD 
storage capacity and genomic data produced (in 
1000 nucleic acids per day). Image taken from [20] . 
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• A taxonomy for approaches towards parallelism in sci- 
entific workflow management systems. This taxonomy 
focuses on concepts which are treated differently by 
many SWfMS and can arguably be improved upon. At 
the same time, the taxonomy omits aspects which are 
either not related to parallelization or don't serve as 
distinctive features since they are implemented identi- 
cally in most SWfMS. 

• A comparative overview of parallelization techniques 
and computational infrastructures supported by cur- 
rent SWfMS. We believe that such a contrasting jux- 
taposition can serve both as a reference for researchers 
and as a starting point for scientific workflow users 
with large amounts of scientific data at hand. 

• An outline of current trends along with a discussion of 
future research questions that could leverage scientific 
workflows as the standard model of computation for 
parallel execution of computatianally intensive inten- 
sive analysis pipelines. 



Figure 5: Development of the number of cores per 
chip over the last decades. Image taken from [21] . 
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Figure 6: Total number of objects stored in the 
Amazon Simple Storage Service (Amazon S3) since 
its introduction in 2006. Image has been published 
on the Amazon Web Services Blog in April 2012^ 



To the best of our knowledge, an overview of parallelization 
techniques for SWfMS in the era of cloud computing has 
not been published yet. Perhaps the most similar work to 
ours has been conducted in 2006 by Yu and Buyya [22] . 
who presented a comprehensive taxonomy of SWfMS for 
grid computing. Their taxonomy is comprised of close to 
a hundred terms and is strongly connected to execution on 
a grid infrastructure. In this survey, we chose to employ 
a more general, light-weight taxonomy and discuss current 
SWfMS running on a wider range of computational infras- 
tructures. 

The rest of the paper is organized in the following way. Fun- 
damental concepts of parallelism in scientific workflows are 
outlined in Section[2] This encompasses types of parallelism, 
parallel computation infrastructures, and scheduling tech- 
niques. Concrete realizations of parallelism in current sci- 
entific workflow management systems are described in Sec- 
tion [3] We then discuss arising research questions and sum- 
marize in Section [4] 

1 http: / /tinyurl.com/6uh8n24 



2. ASPECTS OF PARALLEL SCIENTIFIC 
WORKFLOW EXECUTION 

Scientific workflow systems vary greatly in their means to 
accomplish parallel computation. In this section, we outline 
and contrast fundamental strategies towards parallelization. 
We differentiate between basic types of parallelism, parallel 
computation infrastructures, and different scheduling poli- 
cies. We will later adopt these categories to compare differ- 
ent realizations of parallelism in concrete SWfMS. A graph- 
ical overview of our taxonomy is given in Figure [7] 
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Figure 7: A taxonomy on the most important as- 
pects of parallelization in scientific workflows. 

We highlight the introduced concepts on the basis of an ex- 
emplary workflow from the field of bioinformatics. To this 
end, we use a workflow described by Li and Leal 23 . Ge- 
nomic sequencing aims at revealing the ordered sequence 
of nucleic acids (DNA) in a given sample, such as a human 
chromosome. Until today, this can only be achieved by split- 
ting the DNA sequence into many short overlapping pieces, 
which are called reads. 

The workflow in Figure [8] processes genomic sequencing data 
in the form of such reads. It requires as input two sets 
of reads from different environments, such as healthy and 
pathogenic (disease) tissue. In the first step, which is re- 
ferred to as reference alignment, these reads are mapped 
to a different, much larger reference genome. The aligned 
reads are then compared to the reference genome in detail 
to detect variants, which serve as indicators for mutations. 
Characteristics of detected variants such as rarity are ob- 
tained from external databases like dbSNF0 In the final 
step, the obtained sets of variants are compared between 
healthy and pathogenic tissue. Mutations specific to the 
pathogenic genotype might be related to the disease. The 
associated gene could therefore qualify as a drug target [23] . 



2 htt p : / / www . ncbi . nlm . nih . gov / pro j ect s / SNP / 



The workflow displayed in Figure [8] is an abstract work- 
flow. The data processing steps reference alignment, vari- 
ant calling, and disease-gene association, constitute abstract 
concepts for which concrete algorithms have not been se- 
lected yet. In the following section, we will use this work- 
flow as a reference to showcase different parallelization tech- 
niques. 

2.1 Types of parallelism 

A scientific workflow describes the processing of data by a 
set of tasks. Parallelization addresses the question of how 
to distribute the workload associated with a given work- 
flow on several compute nodes. There are different means 
to accomplish parallelization, all of which involve subdivid- 
ing either the set of workflow tasks or the input data (or 
both) . An important concept in this context is the degree of 
parallelism, which we define as the number of concurrently 
running machines or threads at any given time and which 
can vary for a given workflow depending on the utlized type 
of parallelism. 

In this section, we distinguish three major types of paral- 
lelization 24 : task, data, and pipeline parallelism. We 
describe the characteristics of and relations between these 
three approaches to parallelism. We discuss prerequisites 
that have to be met, problems than can occur, and the de- 
gree of parallelism that can be achieved. All concepts we 
discuss apply equally well to the level of multiple machines 
or multiple threads on a single machine. For ease of exposi- 
tion, in the following discussion we shall focus on the former 
case. 

Task parallelism 

Task parallelism is achieved when the tasks composing a 
workflow are distributed over several independent compute 
nodes. It is only applicable to tasks located on parallel 
branches of the workflow. Data dependencies and the overall 
number of components within the workflow graph strongly 
limit opportunities for task parallelism. The maximum num- 
ber of parallel tasks can be computed easily in advance 
by analyzing the workflow structure. However, choosing 
the right scheduling strategy is not trivial, as parallel tasks 
might exhibit varying runtimes. Differences in task runtimes 
are also the reason for junctions in the data flow being diffi- 
cult to handle: tasks requiring input data from several con- 
current parent tasks have to wait until all parent tasks have 
finished execution. Buffering of intermediate results or syn- 
chronization of task execution is therefore required for task 
parallelism. 

A major advantage of task parallelism is that existing work- 
flows need not be adjusted for tasks to be run concurrently, 
because the graph structure of the workflow alone provides 
sufficient information. This makes task parallelism fairly 
easy to implement. With respect to the example introduced 
in Figure [8] the tasks of reference alignment, variant calling, 
and variant characterization are suitable for parallel execu- 
tion, as they lie on parallel branches of the workflow and 
don't have any data dependencies in between (see Figure [9]). 
Evidently, the achievable degree of parallelism is fairly lim- 
ited for workflows which are mostly comprised of sequential 
tasks. 
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Figure 8: A bioinformatics workflow which processes genomic sequencing data |23j. In reference alignment, 
two sets of DNA reads — from a disease and healthy sample respectively — are mapped onto an established 
(and different) reference genome. Alignments are investigated for variants: mismatching nucleic acids, which 
might be indicative of mutations. Variant sets are then characterized and compared between the two samples. 
Mutations specific to the disease sample might be functionally associated to the disease. The crossed out data 
dependency link between the reference alignment and variant calling tasks indicates variant calling being a 
pipeline blocker, i.e., it can't commence until reference alignment has finished processing all of its data. 
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Figure 9: Task parallelism illustrated by the bioinformatics workflow from Figure |8j The separated horizontal 
lanes correspond to different compute resources processing different tasks in the course of time. The reference 
alignment, variant calling and variant characterization tasks can be run independently of another. In this 
instance they are run task parallel on the compute resources A and B, resulting in a degree of parallelism of 
two. 



Data parallelism 

In data parallelism, input or intermediate data is split into 
distinct chunks, each of which is processed on a different 
compute node. This means that the workflow - or a part 
thereof - is replicated on each compute node for a different 
fragment of the data. Depending on the granularity of data 
(i.e., how many chunks the data can be split into), very high 
degrees of parallelism are achievable. 

Data parallelism is only feasible if data can be split into in- 
dependent chunks. Most suitable for data parallelism are so- 
called embarrassingly parallel problems in which data items 
can be processed independently from each other. In the 
bioinformatics workflow introduced in Figure [Sj the refer- 
ence alignment task is an embarrassingly parallel problem 
because every read can be mapped to the reference indepen- 
dently of all others. In contrast, partitioning the data in 
variant calling is much more complex since all reads over- 
lapping a position in the reference have to be considered 
jointly to tell true variants from noise in single reads. See 
Figure [10] for a possible implementation of data parallelism 
in the bioinformatics workflow introduced earlier. 

In general, we distinguish three techniques for achieving data 
parallelism: (1) data can be manually split and distributed 
among several replicate processing tasks, effectively exploit- 
ing concepts of task parallelism; (2) scientific workflows can 
be specified ad hoc in a language inherently supporting data 
parallelism, such as languages based on the Map Reduce pro- 
gramming paradigm [25]; (3) computationally demanding 
tasks within a workflow can be annotated post hoc with 
metadata describing how data is to be processed in parallel. 
All of these methods require either additional expenditures 
from the user or extended functionality from the SWfMS to 
exploit data parallelism. Also, the techniques vary strongly 
in effort to set up and potential for parallel execution. 
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Pipeline parallelism 

In pipeline parallelism, sequential steps of data processing 
are executed simultaneously on different parts of the input 
data. Thus, partial output data produced by workflow tasks 
is passed to follow-up tasks for immediate consumption - 
similar to workers on an assembly line. Pipeline parallelism 
shares characteristics with task and data parallelism and can 
be considered a subset of both: 



• Similar to task parallelism, the tasks composing a sci- 
entific workflow are distributed over independent com- 
pute nodes. However, this distribution is not limited 
to tasks on parallel branches of the workflow, resulting 
in a potentially higher degree of parallelism. 

• As with data parallelism, input data is fragmented and 
processed independently on different compute nodes. 
However, in contrast to data parallelism, the chrono- 
logical order in which fragments of data are processed 
as well as the assignment of tasks to compute nodes are 
determined in advance and restricted by the concept 
of the pipeline. 

• At its finest granularity, pipeline parallelism is closely 
related to streaming [24] . 



Figure 11: Pipeline parallel execution of the work- 
flow from Figure [8j Again, horizontal lanes corre- 
spond to different compute resources. There is a 
synchronization point between reference alignment 
and variant calling, since variant calling is a pipeline 
blocker. Subsequent to reference alignment, dif- 
ferent chunks of the alignment data are located 
and processed at different stages of the processing 
pipeline. While the first fragment of data already 
undergoes disease— gene association, variants in the 
second fragment are characterized, and the third 
fragment is still investigated for variants. All of 
these tasks are executed in parallel on different com- 
pute resources. If the tasks vary in their computa- 
tional cost, this assignment of resources certainly is 
not optimal. Additional resources could be assigned 
to slower tasks, allowing for a more fine-grained bal- 
ancing of workload among resources. 
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Figure 10: A possible implementation of data parallelism in the workflow introduced in Figure |8j Again, 
horizontal lanes correspond to different compute resources. The alignment task is embarrassingly parallel 
and can be replicated for different fragments of input data. Since variant calling is a pipeline blocker, all the 
alignments have to be merged before further processing can commence. Subsequent to this merger, variant 
calling, variant characterization, and disease— gene association are performed data parallel on fragments of 
the alignment. Since these tasks are not embarrassingly parallel, additional effort has to be put into how to 
split and merge input and output files. 



One common problem especially in scientific applications 
emerges if tasks composing a sequential workflow vary in 
their computational cost. In order to execute sequential 
tasks with different runtime in parallel, advanced scheduling 
and buffering of intermediate results is essential. Another 
problem arises if a task requires input data to be present 
as a whole before it is able to start execution (a so-called 
pipeline blocker). A prominent example of such a task is 
sorting. 

Figure [TT] illustrates pipeline parallel execution of the work- 
flow introduced in Figure [8] After some execution time has 
passed, different fragments of data are located at different 
stages of the processing pipeline, processed by different com- 
pute resources. In this example, all tasks are assigned their 
own compute node. This certainly isn't the optimal assign- 
ment if the tasks vary strongly in their computational cost. 

Hybrid schemes 

Task, data, and pipeline parallelism are not antagonistic by 
any means and can be combined for increased effect. In the 
case of the bioinformatics workflow introduced in Figure [8] 
reference alignment can be performed for different fragments 
of input data on distinct compute resources. The subse- 
quent tasks of variant calling and characterization as well as 
disease-gene association can then be realized as a data pro- 
cessing pipeline. This realization of the workflow therefore 



utilizes all three types of parallelism (see Figure 12) 



All of the introduced types of parallelism have the aim to 
split the workload associated with a scientific workflow, yet 
they differ in the means by which they accomplish this goal. 
Clearly, the abundance of parallelization techniques compli- 
cates the act of scientific workflow scheduling. Approaches 
towards scheduling of scientific workflows will be outlined in 
Section ESI 



2.2 Computational infrastructures for paral- 
lel processing 

Realizations of parallel workflow execution are usually de- 
signed to run in a particular computation environment. Gen- 
erally, one can differentiate between three settings: local 
cluster, compute grid, and compute cloud. In this section, 
we give a short summary of these compute infrastructures, 
which is based on the definitions in [261 127] and the work 
of Foster et al. [28]. 

We define a compute cluster as a set of tightly connected 
computers which operate as a single system. Ever-increasing 
numbers of cores per CPU and CPUs per cluster (see Fig- 
ure [5| have led to a strong potential for parallelization. 
However, up-front acquisition costs are fairly high, which 
is problematic if compute resources are only required when 
new experimental data is available for analysis. For instance, 
bioinformatics workflows such as the one illustrated in Fig- 
ure [8] process genomic data, which is typically generated 
infrequently yet requires substantial computational effort to 
process. 

The sporadic need for the processing capabilities of large 
clusters ultimately resulted in increased efforts of resource 
sharing by the scientific communities. In the early 1990s, 
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Figure 12: Implementation of the workflow from 
Figure [8] utilizing task, data and pipeline parallelism. 
Again, horizontal lanes correspond to different com- 
pute resources. While reference alignment is per- 
formed on concurrently on different subsets of input 
data, the following tasks are executed in the form of 
a data processing pipeline. 



grid computing was promoted as a new paradigm of dis- 
tributed computing in which compute resources could be 
heterogeneous and geographically far away from the client. 
The idea was to connect compute resources of different pro- 
prietors in order to solve computationally demanding prob- 
lems without the need for a supercomputer. Several SWfMS, 
such as Pegasus [9] or Condor DAGMan [291 130] , have been 
developed to utilize grid resources for parallel execution of 
computationally intensive workflows. 

Cloud computing describes a more recently established form 
of distributed computing, which provides rentable compute 
and storage resources on-demand and over the Internet [27] . 
The "pay-per-use" cost model of commercial cloud providers 
charges users by the hour and is therefore especially inter- 
esting in environments where data analysis is infrequent yet 
computationally intensive. Resources are generally made 



available at different levels of abstraction, namely infrastruc- 
ture as a service (IaaS, e.g., Amazon Web Service^]), plat- 
form as a service (PaaS, e.g., Microsoft Windows Azur^] 
Google App Enginaj), and software as a service (SaaS, e.g., 
Google AppjQ Galaxy [13]) [28]. 

In SaaS, concrete applications are hosted in cloud infras- 
tructure and accessed from the client via specific interfaces. 
For instance, Galaxy [13] and Mobyle [14] are two SWfMS 
for the life sciences, which can be accessed by users from a 
web browser and which execute workflows on shared com- 
putational infrastructure. In PaaS and IaaS, resources are 
provisioned in the form of virtual machines including a cus- 
tom operating system and an ephemeral disk, which allows 
storage of intermediate results for as long as the virtual ma- 
chine is leased. The fundamental difference between PaaS 
and IaaS lies in the fact that in IaaS the user has more con- 
trol over the runtime environment and middleware, which 
in PaaS are managed by the cloud provider. 

Many SWfMS have been extended to allow the utilization 
of IaaS and PaaS cloud resources (e.g., Pegasus 31-33 , 
Swift [E]), while others especially in the life sciences have 
been developed particularly with the cloud environment in 
mind (e.g., [34], [35]). Elasticity, which denotes the possibil- 
ity to provision additional resources at runtime, constitutes 
a key benefit of cloud computing, yet has not been utilized 
to full capacity by SWfMS. 

Despite commercial cloud vendors providing guarantees with 
regards to processor clock speed and memory capacity, the 
actual performance of rented virtual machines varies greatly 
depending on the configuration of underlying hardware and 
utilization of shared resources by other users. In Amazon 
EC2, Dejun et al. [36] observed response times of CPU- and 
I/O-intensive web applications to vary by a factor of four and 
two respectively. Jackson et al. 37 found network commu- 
nication between virtual machines to vary by a factor of up 
to 1.7 due to sharing of network resources. Zaharia et al. 
[38] reported I/O performance to vary by a factor of up to 
2.7, depending on how many virtual machines performed 
I/O operations on the same physical hardware. Apparently, 
compute clouds are far more heterogeneous and dynamic 
than commonly perceived. For SWfMS, this finding trans- 
lates into an elevated importance of adaptivity in scheduling, 
as outlined in the next section. 

While it is difficult to conduct a comprehensive comparison 
of scientific workflow execution on cluster, grid, and cloud 
infrastructures, Hoffa et al. [31] contrasted execution of the 
Montage workflow [39 in Pegasus on a local machine as well 
as a remote grid and cloud infrastructure with up to four uti- 
lized CPUs of comparable performance. Montage consists of 
a very large number of tasks with a runtime on the order of 
a few seconds. In the grid and cloud environments, Hoffa 
let al. I observed substantial wide-area data transfer and de- 
lays in instantiation of large amounts of short tasks, leading 
to performance degredation. 



3 http:/ /aws. amazon.com 
4 http:/ /www. windowsazure.com 
5 https: / / developers.google.com/appengine 
6 http:/ /www. google.com/apps 



Clearly, the execution of scientific workflows presents differ- 
ent challenges depending on the underlying computational 
infrastructure. Since computational resources in clusters are 
tightly coupled, data locality is less of an issue compared to 
distributed environments, such as grids and clouds. Com- 
pute clusters also provide a more homogeneous environment 
in terms of CPU performance and latency / bandwidth be- 
tween compute nodes. However, scalability of compute clus- 
ters is limited and up-front investment costs are arguably 
high. 

2.3 Workflow scheduling 

Scheduling a scientific workflow involves mapping concrete 
tasks to the available physical resources 40 . This usually 
involves optimizing a cost function which can incorporate 
estimates for task runtimes and compute resource perfor- 
mance. More sophisticated schedulers may also include net- 
work data transfer in their runtime assessment (e.g., [41]). 
While not examined in this work, in certain applications it 
might be preferable to optimize monetary cost investment or 
maximize data security, as opposed to minimizing workflow 
execution time [22] , 

Since all possible placements of tasks on compute resources 
have to be considered, scheduling scientific workflows is NP- 
complete in the number of workflow tasks 42 (by reduction 
from Minimum Multiprocessor Scheduling [43 J. Scheduling 
is therefore typically approached heuristically, for instance 
by restricting the scheduling perimeter (i.e., the maximum 
number of tasks considered for scheduling at the same time) . 
The processing of large data volumes along with the exis- 
tence of data dependencies between tasks separate scientific 
workflow scheduling from traditional scheduling in operating 
system design |44j . 

For the task of workflow scheduling, it is highly beneficial to 
have estimates on the runtimes of individual components, 
the performance of different compute nodes and the data 
transfer rates between compute nodes. The accuracy of task 
runtime estimates is known to largely affect scheduling and 
overall workflow completion time [45, 46 . Approaches of 
runtime estimation can be separated into three major groups 

• Empirical models (e.g., [47]) treat tasks as black boxes 
and model their behavior based on past performance. 
Since most scientific workflow management systems 
provide native support for capturing provenance data, 
performance prediction based on previous runs appears 
to be a promising technique. 

• Analytical models (e.g., [48]) employ historical perfor- 
mance data, yet attempt to describe the mechanics 
underlying a task as a composition of mathematical 
functions. 

• Simulation techniques (e.g., [40]) such as sampling re- 
quire a small subset of actual (or artificial) input data 
to be distributed among available resources along with 
necessary executables. The time required for task ex- 
ecution and data transfer is captured and serves as 
an initial assessment of the compute resources' perfor- 
mance. 



In the context of scientific workflow scheduling, performance 
instability on possibly shared computational infrastructure 
as well as dynamically changing sets of available compute 
resources can be problematic if schedules are generated in 
advance or schedulers are oblivious to changes in the com- 
putational infrastructure. Adaptive scheduling denotes the 
ability to adjust workflow execution to a dynamically chang- 
ing compute infrastructure at runtime with the aim of min- 
imizing time to workflow completion. One can differentiate 
between several levels of adaptivity in scientific workflow 
scheduling [22]: static, job queue, and adaptive scheduling. 

In the following, we introduce these concepts using the ex- 
ample of the bioinformatics workflow introduced earlier and 
executed data parallel, as illustrated in Figure [To] Figure [13] 
contains the same workflow along with arbitrary numbers for 
estimated and actual runtimes of every task category (refer- 
ence alignment, variant calling, variant characterization, and 
disease-gene association) on each out of three available com- 
pute resources A, B, and C. Different underlying hardware 
and network infrastructures result in the compute resources 
exhibiting a different runtime behavior when assigned cer- 
tain tasks. For instance, compute resource B has a com- 
parably short runtime for variant characterization (possibly 
due to a favorable network connection), whereas it exhibits 
lackluster performance for variant calling. 

Static scheduling 

In static scheduling, schedules are assembled prior to work- 
flow execution and strictly abided at runtime. While this 
method can yield good results in controllable or homoge- 
neous compute environments, variations in resource per- 
formance can strongly impair overall execution time. Ex- 
amples for static scheduling strategies have been presented 
in HOI HSJ S9] . Pegasus is an example for a SWfMS imple- 
menting a static scheduling scheme [9]. 



Figure 14 (a) illustrates a static schedule assembled accord- 
ing to the Heterogeneous Earliest Finishing Time (HEFT) 
scheduling heuristic In HEFT, the workflow graph is 

traversed from the end to the beginning. Estimated times to 
finish workflow execution are computed at each task node, 
taking into account expected times for task execution and 
data transfer. To minimize overall time to completion, tasks 
with highest expected time to overall workflow completion 
are mapped onto faster resources. Replacing runtime es- 
timates with actual task execution times, as listed in Fig- 
ure [13] results in the execution trace illustrated in Figure [l4| 
(b) with an overall execution time of 34. Note that data 
transfer times are not considered in this example. 

Job queue scheduling 

Job queue scheduling encompasses methods which assign 
tasks among compute resources in first-come- first-serve man- 
ner at runtime. The scheduler is oblivious to both the run- 
time statistics of the distributed compute infrastructure as 
well as the characteristics and requirements of individual 
workflow tasks. Implementations of job queue scheduling in 
SWfMS can be found in [5QH52]. 

Figure[l4](c) shows the execution trace for greedy job queue 
scheduling of the workflow from Figure [13] Here, tasks are 
put into a FIFO queue as soon as they're ready to execute. 



Idle compute resources extract tasks from this queue. This 
results in a total execution time of 30. 

Adaptive scheduling 

An adaptive or dynamic scheduler actively monitors the 
computational infrastructure. It then adjusts workflow ex- 
ecution at runtime according to observed changes either by 
re-scheduling a previously assembled schedule (as proposed 
by |53H55| ) or by suspending resource assignment until a 
task is ready to execute (as implemented in DAGMan and 
Swift [301 EE]). 

A fully adaptive scheduler maps tasks onto suitable com- 
pute resources at runtime according to not only the current 
performance statistics of the resource, but also the specific 
requirements and characteristics of the task. For instance, a 
fully adaptive scheduler would attempt to assign an I/O- 
intensive task to a compute resource with above-average 
I/O-throughput. Clearly, reorganization of task execution 
at runtime comes at a cost, so there is a tradeoff to con- 
sider. 

The possible benefits of fully adaptive scheduling are illus- 
trated in Figure 14 (d). Each compute resource keeps track 
of its execution times for certain tasks and compares their 
performance to other resources. Whenever it finishes the ex- 
ecution of a task, it chooses to execute a new task for which 
it knows to exhibit above average performance or for which 
it has not obtained performance statistics yet. This way, the 
workflow introduced in Figure [13] could be executed in mere 
25 units of time. 

3. PARALLELISM IN CURRENT SWFMS 

In recent years, a number of concrete SWfMS have been 
developed and researchers have begun to adopt scientific 
workflows as model of computation of choice for their anal- 
ysis pipelines. This development is reflected by the growth 
of public workflow repositories like myExperiment [17 (see 
Figure 15). At the same time, ever-increasing quantities 



of data generated in scientific experiments have elevated 
the demand for parallel execution of scientific workflows. 
While growing numbers of cores on servers (see Figure [5} 
along with novel computational infrastructures implement- 
ing sharing and leasing of compute resources can provide the 
computational backbone for massively parallel computation, 
scientific workflow enactment in parallel and/or distributed 
infrastructures brings with it a number of design choices, 
which were outlined in Section [2j 

In this section, we give an overview of concrete support 
for parallelism in existing SWfMS. We focus our analysis 
to SWfMS that provide at least some minimal capabilities 
for parallel execution and that were actively maintained at 
the time of writing. These criteria apply well (but not ex- 
clusively) to the systems Swift, Condor DAGMan, Pega- 
sus, Taverna, KNIME, and Kepler. While not technically 
SWfMS, we also shortly outline massively data parallel pro- 
gramming models like MapReduce and PACTs. We illus- 
trate how these systems implement the concepts of paral- 
lelism outlined in Section [2] namely their supported types 
of parallelism, adoption of parallel computation infrastruc- 
tures, and abilities with regard to scheduling. Table [2] gives 
a summary of our findings. 
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Figure 13 : The bioinformatics workflow introduced in Figure [8| executed data parallel as illustrated in 
Figure |10| Estimated and actual runtimes (in arbitrary units of time) are given for each category of tasks 
(reference alignment, variant calling, variant characterization, and disease— gene association) on each of three 
available compute resources A, B, and C. Depending on the underlying hardware configuration (CPU, I/O 
throughput, network speed, etc.) compute resources have a different affinity towards certain tasks. In this 
example, the compute resources are assumed to exhibit constant performance, though in a real-world scenario 
this might not be the case especially if resources are shared between users. Note that data movement is not 
taken into account in this example. 
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Figure 14: Different scheduling techniques and their effect on the execution trace for the workflow from 
Figure |l3[ Horizontal lanes correspond to the compute resources A, B, and C. The numbers inside the task 
nodes match the numbers of tasks introduced in Figure 
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(a) A schedule obtained via static scheduling 
using the HEFT heuristic without incorporating data transfer times, (b) The execution trace generated when 
the HEFT schedule is strictly abided, (c) The execution trace if scheduling is performed using a simple job 
queue, (c) The execution trace for a form of adaptive scheduling, in which tasks are scheduled on resources 
on which they have performed above-average in the past. 
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A categorization of SWfMS with regard to their supported types of parallelism, computational infrastructure, and scheduling 
techniques. Multithreading capabilities indicate support for parallel execution on both local multicore machines and clusters. 
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Figure 15: Amount of Taverna workflows in Scufl 
and T2flow format uploaded to myExperiment per 
month as well as the total number of Taverna work- 
flows (overall and distinct) in myExperiment (image 
taken from [57]). 

3.1 Textual workflow languages 

This class of SWfMS encompasses textual languages which 
facilitate the enactment of large workflows in distributed 
compute infrastructures. In contrast to batch scripts, these 
languages provide support for highly parallel computation. 
As they were mostly designed to run on heterogeneous, geo- 
graphically distributed compute resources, considerable ad- 
ministration effort is required for installation and utilization. 
Hence, the focus of these systems lies less on ease of use and 
more on efficient computing of high workloads. Like in a 
batch script, workflows are specified in a text file (like XML) 
according to a proprietary notation defined by the SWfMS. 

3.1.1 Swift 

Among the pioneers of parallel workflow execution is the 
Swift parallel scripting language [16 . Swift provides a func- 
tional language in which workflows are modeled as a set of 
program invocations with their associated command-line ar- 
guments as well as input and output files. Swift scripts are 
oblivious to the runtime environment and can be executed 
on local multicore computers, clusters, grids, or clouds. The 
execution engine of Swift supports task parallelism, as a task 
is dispatched for execution in a distributed environment as 
soon as all its input parameters are available. As data depen- 
dencies are resolved, the workflow is expanded dynamically 
at runtime. 



For each known compute resource, the Swift execution en- 
gine maintains a score which increases with each successful 
task execution and decreases with each failed attempt [56] . 
Tasks are assigned reactively to compute resources at run- 
time and the higher the score of a resource the more tasks 
will be assigned to it. Hence, changes in the performance 
of compute nodes are reflected in the score and taken into 
account by the scheduler. While scheduling reacts to alter- 
ations in the workload of a resource, it can't utilize elasticity 
in the form of provisioning of additional compute resources 
at runtime. Swift implements capabilities for failure recov- 
ery (retry, restart, and replication) as well as provenance 
tracking. 

In summary, Swift implements adaptive scheduling and task 
parallelism on arbitrary compositions of local machines and 
clusters as well as grid and cloud infrastructures, as shown 
in Table [2] Swift has been utilized for computationally in- 
tensive applications from various fields of science, including 
physics and the life sciences. For an overview of compu- 
tationally intensive scientific applications implemented in a 
SWfMS discussed in this survey, see Table [3] 

3.1.2 Condor DAGMan 

Condor [29] is a batch job scheduler for high-throughput 
computing on distributed resources. It was originally de- 
signed to scavenge idle workstations for CPU cycles, yet has 
been extended with functionality to interface grid resources 
via Condor-G. Condor puts a strong emphasis on reliability 
of execution in the form of job checkpointing, recovery, and 
migration. Condor provides a resource allocation language 
called ClassAds [58] ("classified advertisements") to describe 
requirements and preferences for matchings between tasks 
and compute nodes. For instance, ClassAds can be utilized 
to describe that a machine only accepts tasks if its current 
workload is low and it has been idle for an hour or that a 
task prefers to be executed on a machine with good float- 
ing point performance. Condor maintains a queue of tasks 
in which new tasks are scheduled on the compute resource 
with the best matching ClassAd. Scheduling in Condor can 
therefore be considered adaptive, provided the ClassAds are 
specified such that machines with a high workload are less 
likely to be assigned new tasks. 

Condor's Directed Acyclic Graph Manager (DAGMan) [30] 
provides the means to textually specify a workflow as a DAG 
describing a set of tasks along with their data interdepen- 
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dencies. DAGMan operates as a Condor job and supervises 
workflow execution, submitting workflow tasks which are 
ready for execution to Condor one at a time. Data paral- 
lelism and pipelining are not natively supported since a task 
is not submitted for execution until all of its input data is 
available. DAGMan does not communicate with the Con- 
dor scheduler, hence advanced workflow characteristics such 
as task runtime estimates are not considered in scheduling. 
Since DAGMan does not manage the movement of interme- 
diate data products between jobs, data movements have to 
be explicitly specified within the workflow DAG. Similar to 
Condor, DAGMan emphasizes reliability. In case of failure, 
DAGMan compiles a rescue DAG from which execution can 
be resumed. 

As shown in Table [2] the SWfMS DAGMan supports adap- 
tive scheduling and parallel task execution on a grid infras- 
tructure. Since a compute grid might not be available to 
some users, efforts have been made to execute DAGMan 
workflows on local compute infrastructure without Condor 
(e.g., [78]). 

3.1.3 Pegasus 

DAGMan does not provide the means to automatically set 
up auxiliary tasks, such as data movement, cleanup, or work- 
flow optimization. Therefore, Deelman et al. 9 developed 
Pegasus as a data-aware layer on top of DAGMan, which 
introduces capabilities for provenance tracking, execution 
monitoring, and failure recovery. The SWfMS Pegasus can 
be viewed as a collection of DAG transformers which itera- 
tively translate a concrete workflow into an executable DAG- 
Man workflow. To this end, Pegasus queries and maintains 
catalogs of available computational resources, data repli- 
cates, and data processing software/services. Pegasus also 
prunes and optimizes workflow structure by omitting tasks 
for which output data has been computed previously and by 
clustering short-running tasks into joint DAGMan jobs. 

Tasks in the DAGMan input file generated by Pegasus are 
location-specific, i.e., Pegasus overrides the default schedul- 
ing mechanism of Condor (Class Ads) . Instead, Pegasus pro- 
vides four different scheduling strategies by default: 



• Random: Tasks are randomly assigned to compute re- 
sources able to execute them. 

• Round- Robin: Tasks are evenly distributed among re- 
sources, independent of the associated computational 
cost. 

• Group: Tasks can be put into user-defined groups. 
Each group of tasks is scheduled to run on the same 
compute resource. 



• HEFT: The Heterogeneous Earliest Finishi ng T ime 
scheduling heuristic [41] described in Section 2.3 Pe- 
gasus assumes default costs for data communication, 
whereas the runtime for tasks has to be specified by 
the user. 



All these scheduling strategies result in a static schedule. 
Since the workload on the compute infrastructure might 
be subject to change, Lee et al. [53] developed an adap- 
tive scheduling mechanism for Pegasus. Here, job queues 
on execution sites are observed and compared. In case of 
sustained discrepancies, Pegasus re-schedules the workflow 
from the current point of execution. 

Pegasus was originally designed to distribute computation- 
ally intensive workflows across grid infrastructures. How- 
ever, the growing interest in cloud computing has led to ef- 
forts to efficiently run Pegasus in a cloud infrastructure [31J - 
[33] . In its productive version, Pegasus supports task par- 
allel execution of scientific workflows on grid and cloud in- 
frastructures using a static schedule (see Table [5]). Several 
computationally intensive workflows from different scientific 
domains have been implemented and executed in Pegasus 
(see Table [3}. 



3.1.4 Other textual workflow languages 
While aforementioned textual workflow languages can man- 
age execution of thousands of tasks in parallel, evaluation 
of workflow files (e.g., Swift scripts) is performed on a sin- 
gle node. In the light of ever- increasing concurrency in 
today's computing systems, this may constitute a bottle- 
neck for workflows consisting of very large numbers of short- 
running tasks. For this reason, Wozniak et al. [79] im- 
plemented Turbine, an extreme-scale workflow management 
system with an execution engine distributed over multiple 
compute nodes, which is responsible for resolving data de- 
pendencies, managing task distribution and load balancing, 
and providing global data storage. Turbine can compile 
Swift scripts and has been shown to be able to distribute 
more than 20,000 workflow tasks per second. 

Islam et al. [80] found many established workflow manage- 
ment systems to lack in scalability. Hence, they developed 
Oozie, a scalable workflow scheduler which is built on top of 
Hadoop [81] , the open source implementation of the Map Re- 
duce 25 programming paradigm. The Oozie server accepts 
textually specified workflow DAGs submitted by multiple 
Oozie clients, splits these workflows into sub-tasks, and dis- 
patches the sub-tasks to a Hadoop cluster for processing. 
Oozie is highly scalable and has been utilized by Yahoo! for 
execution of more than 770,000 workflows. 



Ogasawara et al. [82 recently presented a relational alge- 
bra that facilitates highly scalable scientific workflow exe- 
cution. In their work, data is represented in the form of 
relations. Workflow tasks, such as external program or ser- 
vice invocations, are interpreted as one of four operators. 
These operators - Map, SplitMap, Reduce, and Filter - are 
characterized by the ratio at which they consume and pro- 
duce tuples of input and output data. Workflows modeled 
as compositions of these operators can be structurally op- 
timized by applying concepts of database query optimiza- 
tion, such as predicate pushdown. The relational view on 
data enables the scheduler to automatically distribute frag- 
ments of the workflow and parts of the data among several 
processing nodes, resulting in a task and data parallel exe- 
cution. Different scheduling strategies have been shown to 
yield runtime gains, depending on workflow structure and 
its opportunities for structural optimization. 

Instead of designing workflows ad hoc in a way that allows 
for data parallel execution, de Oliveira et al. [83 proposed 
to outsource only the most computationally taxing workflow 
tasks to external resources. Following this rationale, they 
developed SciCumulus, which can be described as a cloud- 
based scientific workflow middleware. Scientists can wrap 
computationally intensive tasks of their existing workflows 
into SciCumulus cloud activities. To this end, SciCumulus 
provides components for upload, dispatch, download, and 
provenance capture. These components can be interfaced 
from within a SWfMS like Taverna, Pegasus, or Kepler. By 
using predefined or custom cartridges, users can specify how 
data is to be fragmented before and merged after process- 
ing for data parallel computation. SciCumulus can also be 
employed to perform parameter sweeps on workflow compo- 
nents in parallel. 

Cieslik and Mura [84] developed PaPy ("Parallel pipelines 
in Python"), a light-weight and modular textual SWfMS 
in which workflow tasks are specified as Python functions. 
PaPy supports task parallelism in the form of a worker pool 
distributed among local and remote resources. Furthermore, 
by allowing input data to be split and processed in chunks, 
PaPy can be configured to perform data parallel computa- 
tion. 

3.1.5 Massively data parallel query languages 
Subsequent to the publication and wide-spread adoption 
of Google's MapReduce [25] programming model and its 
open source implementation Hadoop [81], a class of sys- 
tems that can be summarized under the term "dataflow lan- 
guages" have emerged. These textual languages including 
DryadLINQ [85] from Microsoft, Stratosphere's Meteor [86] , 
Pig [87 from Yahoo!, Hive 88 from Facebook, and the As- 
terix Query Language (AQL) [89] from UC Irvine were de- 
signed to efficiently perform query-style dataflow programs 
over extremely large data in parallel. 

Dataflow programs specified in these languages are trans- 
lated into DAGs of Dryad vertices, Stratosphere paralleliza- 
tion contracts (PACTs [90]), Hadoop map and reduce jobs, 
or Hyracks [9j operators. Similar to scientific workflows, 
these DAGs describe data dependencies between separate 
data processing steps. Data parallel execution of these DAGs 
is handled by the execution engines underlying the Dryad, 



Hadoop, and Hyracks implementation or - in the case of 
Stratosphere - the Nephele [52 scheduler. See Figure 16 for 



a graphical overview of the stack architecture of aforemen- 
tioned systems. 

While all of these dataflow languages provide the means to 
model computationally intensive problems, they were not 
primarily designed for scientific data. Thus, in contrast to 
most SWfMS, they neither provide domain-specific software 
libraries nor are there any published workflows designed by 
domain scientists. Furthermore, while each task in a scien- 
tific workflow is treated as a black box, dataflow languages 
often require the user to specify the task according to a re- 
strictive query-based syntax, which is typically not easily 
accessible by non- computer- savvy users. For these reasons, 
while MapReduce and similar frameworks have an undeni- 
able influence on parallelization in SWfMS, they are out of 
scope for this survey. 

However, Dryad [50] and Nephele [52], the execution engines 
underlying the dataflow languages DryadLINQ and Meteor, 
respectively, can both be utilized for execution of arbitrary 
workflows. Similar to DAGMan or Pegasus, Dryad maps 
abstract workflows specified as DAGs to their concrete com- 
pute resources. However, while in Pegasus tasks can only ex- 
change data in the form of files, Dryad also supports network 
or shared memory communication. These features enable 
Dryad to perform a pipeline parallel execution in which tasks 
start execution as soon as at least one input record is avail- 
able at each input port. Dryad tasks are written in C++ and 
tasks can be annotated with preferences or constraints to- 
wards the resources on which they are to be executed. This 
puts data- awareness into the hands of the user, since the 
default scheduling strategy of Dryad is a job queue in which 
new tasks are greedily assigned to the first available resource. 

The Dryad framework was developed with a static compu- 
tation environments in mind. In contrast, Nephele [52] also 
supports distribution of tasks on dynamically scalable com- 
pute infrastructures, such as compute clouds. Similar to 
Dryad, Nephele supports task communication in the form of 
shared memory and TCP network connections. Users can 
also specify to cluster several short-running tasks on a single 
compute resource or split long-running tasks into subtasks 
for data parallel execution. If no annotations are provided 
by the user, the Nephele scheduler assigns each task to its 
own compute resource. 

In summary, Dryad and Nephele support task, data, and 
pipeline parallelism on multicore architectures and compute 
clouds using a job queue for just-in-time scheduling. 

3.2 Graphical workflow management systems 

The class of graphical SWfMS comprises systems with a 
strong emphasis on ease of use and graphical representa- 
tion of workflows. Since graphical representation becomes 
problematic for large workflows consisting of hundreds or 
thousands of tasks 1 , most graphical SWfMS provide a 
means to compose workflows from a hierarchical nesting of 
sub workflows. In contrast to textual workflow scripting lan- 
guages, workflows are mostly designed from collections of 
web-services or built-in components which can be adjusted 
by various parameters. 
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Figure 16: Stack architecture of the dataflow languages DryadLINQ, Meteor, Pig, Hive, and AQL. 



Graphical SWfMS are typically installed either on a remote 
server or on a local client and accessed using a graphical 
user interface with drag-and-drop functionality on the client. 
Hence, workflow execution is performed only on a single ma- 
chine. 

Graphical systems sometimes support multithreading, but 
most of them can't utilize external compute resources by 
default, save via integration of web services. However, ap- 
proaches towards execution on distributed computational in- 
frastructures are increasingly being explored. For schedul- 
ing, tasks with available input data are usually put into a 
job queue and executed by replicate workers from a thread 
pool. 

3.2.1 Taverna 

Taverna Workbench [7] is an established graphical SWfMS 
developed for the enactment of bioinformatics workflows. It 
emphasizes usability, providing a graphical user interface 
for workflow modeling and monitoring as well as a com- 
prehensive collection of pre-defined services. Taverna work- 
flows are internally represented in one out of the two tex- 
tual languages Scufl (used by Taverna 1) and T2flow (used 
by Taverna 2). User-generated workflows can be exchanged 
through the my Experiment workflow repository 17 . See 
Figure [2] for a bioinformatics Taverna workflow from my Ex- 
periment. 

Taverna workflows can be executed either on the client or 
on a server. As of yet, Taverna can utilize only a single 
machine, as there is no support for execution of compu- 
tationally intensive tasks on more than one server or on 
distributed architectures, such as grid or cloud infrastruc- 
tures. 

In Taverna, a task can start as soon as input data is available 
at all of its input ports. Each task is processed by a separate 
thread, with the maximum number of concurrently running 
threads being set by the user. If a task receives a list of 
data items on an input port where a single item is expected, 
each element of the list is processed by a replicate of the 
task in a separate thread [51]. Each processed data item is 
passed to following tasks for immediate consumption in a 
new thread. Taverna's implicitly pipeline parallel execution 
model results in multiple replicate data processing pipelines 
running concurrently. 



In summary, while Taverna provides strong support towards 
parallelized workflow enactment in the form of task, data, 
and pipeline parallelism, it is devoid of more sophisticated 
scheduling methods than a job queue and can currently only 
utilize cores on a single local resource (see Table [5]). Several 
computationally intensive problems from the field of bioin- 
formatics have been implemented in Taverna, as shown in 
Tabled 



3.2.2 KNIME 

The Konstanz Information Miner (KNIME) [10] shares many 
characteristics with Taverna, albeit with a stronger focus 
on user interaction and visualization of results, yet with a 
smaller emphasis on web service invocation. Furthermore, 
KNIME focuses on workflows from the fields of data min- 
ing, machine learning, and chemistry, while Taverna is more 
concerned with integration of distributed and possibly het- 
erogeneous data. A graphical user interface facilitates design 
and execution monitoring of workflows. KNIME can either 
be installed locally or on a server, accessible via multiple 
clients. 

As in Taverna, each task ready for execution is put into a 
queue, from which it is retrieved and processed by a sep- 
arate thread. The size of the thread pool is restricted via 
user-defined constraints. While Taverna detects opportuni- 
ties for data parallelism by monitoring the input and out- 
put ports of a task, KNIME requires the designer of a task 
node to explicitly specify whether it qualifies for data par- 
allel execution. If implemented accordingly, KNIME au- 
tomatically splits the entire input data into four times as 
many chunks as the size of the thread pool. Similar to data 
parallelism in map-tasks of the MapReduce paradigm, each 
chunk is then processed by a replicate task and aggregate 
results are merged as soon as all threads have finished exe- 
cution [59]. Note that in contrast to Taverna, KNIME does 
not support pipeline parallelism, as a task can't start exe- 
cution until all of its parent tasks have processed all of their 
data. 

As shown in Table [2] KNIME implements job queue schedul- 
ing as well as task and data parallelism in the form of mul- 
tiple threads on a local machine or on a remote server. Dis- 
tributed architectures like a compute grid or cloud are not 
supported by default. 
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Figure 17: Guideline for choosing a director based 
on a given Kepler workflow [60] . 



3.2.3 Kepler 

Kepler [8] is a frequently used graphical SWfMS. Similar to 
Taverna and KNIME, it provides an assortment of built-in 
components with a major focus on statistical analysis. Nei- 
ther a server-based implementation of Kepler nor support 
for distributed execution are currently provided. Kepler is 
built on top of the Ptolemy II Java library [92], from which 
it inherits the concept of the so-called director. By choosing 
a director, users can specify how a workflow is processed, 
i.e., which scheduling technique should be employed. See 
Figure [IT] for a guideline on which director to choose for 
a given Kepler workflow. Default options include the syn- 
chronous dataflow director (SDF), dynamic dataflow direc- 
tor (DDF), and process network director (PN) [60] . 

SDF and DDF directors execute the whole workflow in a sin- 
gle thread. In contrast, the PN director assigns a separate 
thread to each workflow task, implementing a pipelined exe- 
cution on local Java threads. Data items are passed from the 
output port of parent tasks to the input port of child tasks, 
where they are stored in a buffer. The PN director super- 
vises workflow execution by monitoring buffer sizes and in- 
structing threads to process new data items when available. 
The choice of the director also defines whether scheduling be 
performed statically and prior to execution (SDF director) 
or at runtime (DDF and PN director). 

In an attempt to enable data parallel execution of data- 
intensive Kepler workflows, Wang et al. 93 presented map 
and reduce actors based on the parallel computing frame- 
work Hadoop. By integrating the programming model of 



Map Reduce into Kepler, workflow designers can benefit from 
the parallel programming model of MapReduce without hav- 
ing to worry about the programming interfaces. However, 
a generic word count problem implemented in Kepler using 
this MapReduce component showed clearly inferior perfor- 
mance when compared to the equivalent Java Hadoop im- 
plementation [93] . This is most likely the consequence of the 
Kepler engine having to be initiated separately and repeat- 
edly for each of the map and reduce tasks. 

To summarize, by default Kepler supports task and pipeline 
parallel execution of workflows in possibly multiple threads 
on a local client (see Table [5]). Scheduling is performed stat- 
ically in advance or just-in-time using a job queue. Several 
reports of scientific projects using the Kepler SWfMS have 
been published, as shown in Table [3] 

A guideline on when to choose which of the SWfMS Taverna, 
KNIME, and Kepler is given in Table [4] in form of a com- 
parison. 

3.3 Domain-specific web portals 

This class of SWfMS is comprised of systems that facilitate 
composition of domain-specific data repositories, web ser- 
vices, and applications. A strong emphasis is usually put on 
easy set up, intuitive operation, and sharing of workflows. In 
most cases, workflows are designed in a web browser and ex- 
ecuted on a public or private server. In this survey, we limit 
our examination of domain-specific web portals to SWfMS 
specific to the life science. The rationale behind this is that 
many SWfMS have been tailored specifically to the life sci- 
ences and a considerable number of computational tasks in 
bioinformatics qualify for parallel execution. 

With the advent of next generation sequencing and the gen- 
eral exponential rise of data in bioinformatics, Galaxy [13] 
has been established as one of the major frameworks for ge- 
nomic research in the life sciences. Galaxy comes with a 
web-based user interface as well as assorted pre-built com- 
ponents for popular tasks in sequence analysis. Workflows 
can be assembled from tasks and data repositories, shared 
with other users and executed on a public or private server. 
See Figure [3] for a Galaxy workflow from the public Galaxy 
server. Afgan et al. [94] recently presented CloudMan as a 
scalable resource system for a deployment of Galaxy servers 
in an EC2 Cloud. Note that CloudMan does not distribute 
tasks or data of an individual workflow among virtual ma- 
chines. It merely allows multiple workflows to be run on 
different virtual machines. 

The lack of interoperability between the two major SWfMS 
in the field of bioinformatics, Galaxy and Taverna, severely 
hampers workflow and knowledge sharing within the sci- 
entific community. Therefore, Abouelhoda et al. [95] de- 
veloped Tavaxy, a stand-alone SWfMS which can integrate 
both Taverna and Galaxy workflows at design-time as well as 
at run-time. Tavaxy workflows are specified in tScufl, which 
is highly inspired by Taverna's Scufl language, whereas the 
workflow execution engine underlying Tavaxy can be consid- 
ered an extension of the Galaxy engine. Publicly available 
VM images of Tavaxy provide a means to execute workflow 
tasks, subworkflows or entire workflows on rented Amazon 
EC2 cloud infrastructure. 



Table 4: Comparison of graphical SWfMS 
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enabled by default 

only selected components 

if PN director is selected 



Linke et al. [15] observed that most bioinformatics work- 
flows designed in Taverna and Galaxy are dedicated mostly 
to data conversion, which often results in the loss of com- 
plementary data. Furthermore, with high-throughput ex- 
periments producing continuously more data, processing via 
remote web services is often not feasible. Therefore, they 
presented Conveyor as a client /server-based bioinformatics 
workflow engine with a strongly typed hierarchy of data 
types and an increased focus on local execution. All tasks 
composing a Conveyor workflow are executed in a separate 
thread, effectuating in task parallel execution. 

The typical biologist interested in a bioinformatics analy- 
sis often has to consult several online data repositories and 
combine various analytic tools manually. To facilitate this 
process, Neron et al. [TJ developed Mobyle, a framework 
accessed from within a web browser that federates local 
and remote bioinformatics services and simplifies their com- 
position. Administrators can register new services on any 
Mobyle server and local clients can access the combined ser- 
vices provided by several of these servers. Parallel execution 
of tasks on different servers is handled by Mobyle 's internal 
workflow engine. 

In bioinformatics, progress in high-throughput technologies 
demands continuous development of novel computational 
analysis methods. Kallio et al. [96] found the usage of most 
of these methods to require considerable computational skill. 
Hence, they developed Chipster, an easy-to-use workflow 
management system featuring a large collection of built-in 
data analysis methods as well as interfaces to add newly de- 
veloped methods. Workflow design is performed in a graph- 
ical user interface on a locally installed client, whereas most 
of the computation is conducted on a server. The server 
can be set up to incorporate analysis tools spread across 
several computational nodes, employing the concept of task 
parallelism. 

Several computationally intensive analysis methods in bioin- 
formatics have recently been implemented using the MapRe- 
duce programming paradigm (e.g., sequence alignment using 
CloudBurst [97]). Since installing and utilizing MapReduce- 
based applications on a distributed computational infras- 
tructure might be difficult for domain scientists, Schoenherr 
et al. [98] developed Cloudgene as an extensible execution 
environment for MapReduce programs in bioinformatics fea- 
turing a graphical user interface and a selection of built-in 
components. Cloudgene allows the graphical design and dis- 
tributed execution of analysis pipelines on local clusters or 
public clouds, such as Amazon EC2. 

Another example of employing the MapReduce program- 
ming paradigm for bioinformatics workflows was presented 
by Wu et al. [35] . They developed a life science gateway that 



allows for data parallel execution of workflows in the cloud 
based on Hadoop streaming. Their framework lets users 
specify embarrassingly parallel workflows, such as pairwise 
BLAST sequence alignment, as a series of map and reduce 
tasks, with each task being specified in command-line syn- 
tax. Scientists using this framework can also share data and 
workflows, similar to Galaxy and my Experiment. 

In the light of overwhelming assortments of bioinformatics 
applications and libraries, users might prefer to employ pre- 
built workflows recommended by experts instead of assem- 
bling their own data processing pipelines. Following this 
rationale, Angiuoli et al. [34] developed the Cloud Virtual 
Resource (CloVR), a life science gateway featuring a se- 
lection of four hard-coded workflows covering some of the 
major tasks in next generation sequence analysis. These 
workflows are encased by a virtual machine image and are 
therefore easy to set up and execute. Computationally tax- 
ing BLAST [99] sequence searches occurring in three of the 
workflows are split and distributed among dynamically ex- 
tendable cloud resources, according to a BLAST runtime es- 
timation. While the CloVR virtual machine images provide 
additional pre-installed frameworks for parallel execution, 
such as Hadoop, they are not utilized by any of the default 
workflows. 



4. CONCLUSION AND DISCUSSION 

Scientific workflows have recently emerged as a model of 
computation for processing of scientific data. However, in- 
creasing amounts of data as well as a growing interest of 
the scientific community in data-driven research have even- 
tuated in increasingly high requirements of computational 
power and thus a growing demand for parallelization tech- 
niques. 

In Section [2] we outlined three major aspects of parallelism 
in SWfMS: types of parallelism, distributed compute infras- 
tructures, and approaches towards scheduling. With re- 
gards to these concepts, we observed three classes of es- 
tablished SWfMS in Section [3] (1) textual workflow lan- 
guages, which can distribute workflow tasks over external 
resources, but are difficult to set up by domain scientists 
and often lack support for data and pipeline parallelism 
(Swift, DAGMan, Pegasus); (2) graphical standalone sys- 
tems, which are easy to use but are not able to efficiently 
integrate external resources (Taverna, KNIME, Kepler); (3) 
life science enactment portals in which domain scientists 
can design workflows in a web browser as well as execute 
and share their workflows on a remote and possibly pub- 
lic server (Galaxy, Mobyle, Conveyor). We argue that all 
of these approaches leave considerable room for improve- 
ment: 



Types of parallelism: While textual workflow languages 
provide the capabilities to distribute workflow tasks 
over external compute resources, they often don't sup- 
port data or pipeline parallelism. Despite not all work- 
flows qualifying for data parallel execution, many of 
them do and hence much potential for highly scal- 
able scientific workflow enactment remains untapped 
in textual languages. At the same time, graphical 
standalone systems like Taverna implement data and 
pipeline parallelism yet can't schedule threads on ex- 
ternal resources. We believe that the computational 
model of scientific workflows could benefit greatly if 
both classes of systems would complement or inherit 
concepts from one another. 

Parallel compute infrastructures: Over the last few 
decades, the number of cores per CPU and CPUs per 
cluster has been continuously rising. In more recent 
years, cloud computing technology has reached the 
stage of productivity, providing highly scalable com- 
pute resources on demand. Renting computational 
platforms as a service seems especially beneficial for 
those research applications where high-performance com- 
pute resources are only required infrequently whenever 
new experimental data has been produced. While most 
SWfMS have been adapted to support multithread- 
ing on multicore architectures, few SWfMS are able 
to incorporate grid and cloud resources and none cur- 
rently offer advanced features like autonomous (de- 
allocation of machines at runtime. We argue that 
scientific workflow technology requires new models of 
execution inherent to distributed infrastructures in or- 
der to stay competitive. 

Scheduling: The physical infrastructure underlying dis- 
tributed compute environments like grids and clouds 
is often shared between multiple users. As users ex- 
ecute programs with different hardware requirements 
(e.g., I/O-bound vs. CPU-bound), the performance of 
compute nodes can dynamically change at runtime. 
Scheduling of scientific workflows is therefore best ap- 
proached adaptively, by finding suitable matches be- 
tween workflow tasks and compute nodes based on 
statistics obtained at runtime. Although advanced 
scheduling techniques have been shown to net consid- 
erable runtime improvements, we observed that most 
SWfMS rely on very basic scheduling techniques such 
as greedy job queues or static a priori assignments 
of tasks to compute nodes. Introducing support for 
multiple types of parallelism and computation on dis- 
tributed infrastructures in SWfMS currently not sup- 
porting these concepts translates into more scheduling 
options and will therefore further leverage the impor- 
tance of scheduling. 

Ease of use: We observed that SWfMS either focus 
on usability (as seen in most graphical SWfMS and 
life science portals) or provide support for distributed 
computation (as found in most textual workflow lan- 
guages). We argue that this gap between ease of use 
for no n- computer savvy people and highly scalable ex- 
ecution has to be narrowed in order for the scientific 
community to adopt scientific workflows as model of 
computation of choice for their analysis pipelines. 



• Structural workflow optimization: Scientific workflows 
like Montage [39] consist of a large number of short- 
running tasks. In distributed infrastructures, these 
short-running tasks can introduce considerable run- 
time overhead as a result of latency due to network 
connectivity and workflow engine initialization times. 
Automatic clustering of short-running tasks into com- 
posite tasks as a means of workflow optimization has 
been shown to yield promising results. Furthermore, 
many scientific workflows involve selection, filtering, 
and sorting steps on large collections of data, similar 
to database queries. Approaches towards structural 
workflow optimization inspired by established database 
query optimization techniques can constitute a valu- 
able addition to parallel execution with the aim to 
reduce workflow runtime (e.g., [100] ). Unfortunately, 
support for structural workflow optimization in con- 
crete SWfMS is very limited and can be improved 
upon. 



In closing, while substantial progress has been made in par- 
allel scientific workflow enactment, the field of solutions is 
still heterogeneous and leaves room for improvement. The 
proliferation of cloud technologies will change the computa- 
tional landscape of data-driven research. This will inevitably 
require SWfMS technology to adapt. 

In particular, the following open research topics arise from 
the points discussed above: 



1. Extend current SWfMS with easy to set-up integra- 
tion of local, grid, and cloud infrastructures as well as 
arbitrary compositions thereof. 

2. Investigate approaches towards adaptive scheduling in 
heterogeneous, dynamically changing computational en- 
vironments, such as shared or composite infrastruc- 
tures. 

3. Refine SWfMS to be both easy to use for domain sci- 
entists and provide inherent data parallel execution for 
computationally expensive tasks on massive data. 

4. Explore new means of structural workflow optimiza- 
tion inspired by traditional database query optimiza- 
tion. 

5. Examine possibilities towards storing and searching 
workflows along with their execution traces in public 
repositories in order to reduce redundancy in design 
and execution of workflows. 
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